Outline

1. Problem Statement

Buying and selling used smartphones used to be something that happened on a handful of online marketplace sites. But the used and refurbished phone market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used smartphones that offer considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing a smartphone. There are plenty of other benefits associated with the used smartphone market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished smartphones. Maximizing the longevity of mobile phones through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost the cheaper refurbished smartphone segment, as consumers cut back on discretionary spending and buy phones only for immediate needs.

Objective

The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished smartphones. The goal is to analyze the data provided by ReCell company (a startup aiming to tap the potential in this market) and build a linear regression model to predict the price of a used phone and identify factors that significantly influence it.

2. Data Overview

Importing Nessesory Libraries

Importing Data

Data Dictionary

The provided dataset contains the following columns:

  1. brand_name: Name of manufacturing brand
  2. os: OS on which the phone runs
  3. screen_size: Size of the screen in cm
  4. 4g: Whether 4G is available or not
  5. 5g: Whether 5G is available or not
  6. main_camera_mp: Resolution of the rear camera in megapixels
  7. selfie_camera_mp: Resolution of the front camera in megapixels
  8. int_memory: Amount of internal memory (ROM) in GB
  9. ram: Amount of RAM in GB
  10. battery: Energy capacity of the phone battery in mAh
  11. weight: Weight of the phone in grams
  12. release_year: Year when the phone model was released
  13. days_used: Number of days the used/refurbished phone has been used
  14. new_price: Price of a new phone of the same model in euros
  15. used_price: Price of the used/refurbished phone in euros

Let us take a look at the imported data and the summary of different columns:

Four of the columns represent categorical variables (qualitative), i.e.:

Now we check the missing values in the data. Below, number of missing values in any column of the imported data are shown:

3. Exploratory Data Analysis (EDA)

What does the distribution of used phone prices look like?

Summery of all numerical features are show below:

What percentage of the used phone market is dominated by Android devices?

The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?

A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?

Phone's weight and battery size seem to have positive correlations. Overally, weight of the phone increases as the battery size increase. Histogram of phone weight for large battery sizes are as below:

Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?

Let us sort the above plot based on the counts.

Samsung, Huawei, LG, and Lenovo respectively are the most frequent brands with big screens.

Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?

Which attributes are highly correlated with the used phone price?

4. Data Preprocessing

4.1 Missing Value Treatment

List of variables that have missing values and the amount that is missing is listed below:

There is no missing values in the target variable (used_price). However, six predictor variables have missing values and need treatments. All of these variables are quantitave variables with type 'float64'.

Let us fix the missing values.

For the six predictor variables, we will replace the missing values in each column with the column's median.

We can see that all the missing values have been treated.

4.2 Duplicate value check

Now, we want to figure out whether we have duplicate data and how to deal with them.

As we can see, there are no duplicated rows in our DataFrame.

4.5 Feature Engineering

Checking the levels of each categorical variable (feature), we can see that there are no values that don't look valid and we can proceed with our analysis.

All values seem to be valid.

Data Binning

4.3 Outlier detection and treatment

An outlier is a data point that is distant from other similar points. Linear regression is easily impacted by the outliers in the data. Outliers can distort predictions and affect the accuracy so it's important to flag them for review. This is especially the case with regression models.

Outlier detection using IQR

We use IQR, which is the interval going from the 1st quartile to the 3rd quartile of the data in question, and then flag points for investigation if they are outside 1.5 * IQR. The following function is used to calculate fracttion of outliers for each numerical columns based on IQR.

Let us plot the boxplots of all numerical columns to display outliers.

Outlier Treatment

We treat outliers in the data by flooring and capping as follows:

Let us look at the boxplots to see if the outliers have been treated or not.

4.4 EDA after Data Manipulation

It is a good idea to explore the data once again after manipulating it.

5. Building a Linear Regression Model

5.1 one-hot encodig categorical variable

We want to predict the used price.

Before we proceed to build a model, we'll have to encode categorical features.

Defining predictors(x) and target (y) variables

5.2 Spliting data into train and test datasets

We will split the data into train and test to be able to evaluate the model that we build on the train data.

5.3 Fitting the regression model on the train data

We will build a Linear Regression model using the train data and then in the next section we will check it's performance.

Let us check the coefficients and intercept of the model.

5.4 Linear Regression using statsmodels

The add_constant() does not work if there is already a column with variance=0, i.e. a column with all identical values. In our data set all values in 'ram' column equal to 4. The solution is to add has_constant option in the add_constant() function.

Observations

6. Model Performance Evaluation

Let us check the performance of the model using different metrics.

We will be using metric functions defined in sklearn for $RMSE$, $MAE$, and $R^2$ . We will define a function to calculate $MAPE$ and adjusted $R^2$.

The mean absolute percentage error (MAPE) measures the accuracy of predictions as a percentage, and can be calculated as the average absolute percent error for each predicted value minus actual values divided by actual values. It works best if there are no extreme values in the data and none of the actual values are 0. We will create a function which will print out all the above metrics in one go.

6.1 Creating metric functions for perfomance evaluation

6.2 Evaluating model perfomance on the train and test data sets

Checking model performance on train set (seen 70% data)

Checking model performance on test set (seen 30% data)

Observations

7. Checking Linear Regression Assumptions

In order to make statistical inferences from a linear regression model, it is important to ensure that the assumptions of linear regression are satisfied. We will be checking the following assumptions:

  1. No Multicollinearity

  2. Linearity of variables

  3. Independence of error terms

  4. Normality of error terms

  5. No Heteroscedasticity

7.1 Test for multicollinearity

Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent. If the correlation between variables is high, it can cause problems when we fit the model and interpret the results. When we have multicollinearity in the linear model, the coefficients that the model suggests are unreliable.

There are different ways of detecting (or testing) multicollinearity. One such way is by using the Variance Inflation Factor, or VIF.

Usinng the following function, we calculate VIF for each predictor variable.

The VIF values for dummy variables can be ignored. Among the quantitative variables only release_year with VIP equal to 4.78187 have VIF near 5 and VIP of all other quantitative variables are far from 5. We decide to improve remove the 'release_year' column since from what we learned in EDA section, it has high negative correlation with 'days_used' feature.

Removing multicollinearity

To remove multicollinearity

  1. We will drop the 'release_year' column because it has VIF score greater than 5.
  2. We will look at the adjusted R-squared and RMSE of the new model.
  3. Check the VIF scores again.
  4. Continue till you get all small VIF values.

Let us define a function that will help us do this.

Now that we have dropped release_year column we will check the VIF again.

All numerical and dummy features have VIF far less than 5 and hence there is no multicollinearity and the assumption is satisfied.

Let us check the model performance.

Observations

The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

We extract the data correspondent to the selected features defined above.

We rebuild regression model using statmodels.

Now no feature has p-value greater than 0.05, so we will consider the features in x_train3 as the final ones and olsmod2 as final model.

Observations

7.2 Test for Linearity of variables and Independence

How to check linearity and independence?

How to fix if this assumption is not followed?

Let us create a dataframe with actual, fitted and residual values.

Let us plot the fitted values vs residuals

Let us check the scatter plot of quatitative features in x_train3 data set versus the target variable to check the linearity realtions. Plot below shows these plot for original values of the quantitative predictor features as well as the log transformed values of them versus the target value.

As we can see on these plots, the log transformation on most features does not help improving linieariry relathioships between the target value and the feature. Since the lineariry assumption of the regression model is not satisfied, we will later introduce a non-linear model to be used to better predict the price of used phones. But, for now we will proceed working on the linear regression model.

7.3 Test for Normality

How to check normality?

How to fix if this assumption is not followed?

Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test. Strictly speaking, the residuals are not normal. However, as an approximation, we can accept this distribution as close to being normal.

7.4 Test for Homoscedasticity

The presence of non-constant variance in the error terms results in heteroscedasticity. Generally, non-constant variance arises in presence of outliers.

How to check for homoscedasticity?

How to fix if this assumption is not followed?

Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.

We have checked all the assumptions of linear regression and they are satisfied.

8. Final Model-Summary

Features in x_train3 is considered as the final train set and olsmod2 is our final model.

Now, we can move towards the prediction part.

Note: As the number of records is large, for representation purpose, we are taking a sample of 25 records only.

Checking model performance on train set (seen 70% data)

Checking model performance on test set (seen 30% data)

Observations

Let us compare the initial model created with sklearn and the final statsmodels model.

Let us recreate the final statsmodels model and print its summary to gain insights.

Hence, the degree 2 polinomial regression can better explain variation in the data compared to our proposed linear regression.